Intrinsic Plagiarism Detection using Complexity Analysis
نویسندگان
چکیده
We introduce Kolmogorov Complexity measures as a way of extracting structural information from texts for Intrinsic Plagiarism Detection. Kolmogorov complexity measures have been used as features in a variety of machine learning tasks including image recognition, radar signal classification, EEG classification, DNA analysis, speech recognition and some text classification tasks (Chi and Kong, 1998; Zhang, Hu, and Jin, 2003; Bhattacharya, 2000; Menconi, Benci, and Buiatti, 2008; Frank, Chui, and Witten, 2000; Dalkilic et al., 2006; Seaward and Saxton, 2007; Seaward, Inkpen, and Nayak, 2008). Intrinsic Plagiarism detection uses no external corpus for document comparison and thus plagiarism must be detected solely on the basis of style shifts within the text to be analyzed. Given the small amount of text to be analyzed, feature extraction is of particular importance. We give a theoretical background as to why complexity measures are meaningful and we introduce some experimental results on the PAN’09 Intrinsic Plagiarism Corpus. We show complexity features based on the Lempel-Ziv compression algorithm slightly increase performance over features based on normalized counts. Furthermore we believe that more sophisticated compression algorithms which are suited to compressing the English language show great promise for feature extraction for various text classification problems.
منابع مشابه
Approaches for Intrinsic and External Plagiarism Detection - Notebook for PAN at CLEF 2011
Plagiarism detection has been considered as a classification problem which can be approximated with intrinsic strategies, considering self-based information from a given document, and external strategies, considering comparison techniques between a suspicious document and different sources. In this work, both intrinsic and external approaches for plagiarism detection are presented. First, the m...
متن کاملExternal and Intrinsic Plagiarism Detection Using Vector Space Models
Plagiarism detection can be divided in external and intrinsic methods. Naive external plagiarism analysis suffers from computationally demanding full nearest neighbor searches within a reference corpus. We present a conceptually simple space partitioning approach to achieve search times sub linear in the number of reference documents, trading precision for speed. We focus on full duplicate sear...
متن کاملRDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PANAraPlagDet at FIRE 2015
Many researchers have been investigating the task of plagiarism detection lately. In this paper we present RDI system for intrinsic plagiarism detection (RDI_RID). RDI_RID system was the only system that participated in intrinsic track of the Arabic language plagiarism detection competition. RDI_RID system achieved a PlagDet (Plagiarism Detection score) of 19% compared to 38% achieved by the ba...
متن کاملIntrinsic Plagiarism Detection Using Character n-gram Profiles
The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity mea...
متن کاملIntrinsic Plagiarism Detection using N-gram Classes
When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a novel languageindependent intrinsic plagiarism detection method which is based on a new text representation that we called n-gram classes. The proposed method w...
متن کامل